System and method for computerized semantic processing of electronic documents including themes

ABSTRACT

System and method for computerized identification of themes in a large data set, the system comprising reducing the number of data set members in a large data set, using at least one computerized data set member pruning technique other than random selection; and using a computerized theme identification technique for identifying a plurality of themes in the reduced data set.

Priority is claimed from U.S. Provisional Patent Application No.61/755,242, entitled “Computerized systems and methods for use of themesin e-discovery” and filed Jan. 22, 2013, the entire contents of whichbeing hereby incorporated herein by reference.

FIELD OF THIS DISCLOSURE

The present invention relates generally to computerized processing ofelectronic documents and more particularly to computerized semanticprocessing of electronic documents.

BACKGROUND FOR THIS DISCLOSURE

Wikipedia on “Data_deduplication” states that “In computing, datadeduplication is a specialized data compression technique foreliminating duplicate copies of repeating data. Related and somewhatsynonymous terms are intelligent (data) compression and single-instance(data) storage . . . . For example a typical email system might contain100 instances of the same 1 MB (megabyte) file attachment. Each time theemail platform is backed up, all 100 instances of the attachment aresaved, requiring 100 MB storage space. With data deduplication, only oneinstance of the attachment is actually stored; the subsequent instancesare referenced back to the saved copy for deduplication ratio of roughly100 to 1.”

Also according to Wikipedia, the term deduplication may refer to Datadeduplication, as above, or to Record linkage, in databases, i.e.finding entries that refer to the same entity in two or more files.‘DeDuping’ may involve removing duplicates in Customer and Addressrecords in a Database or Spreadsheet.

The importance of topic modeling for browsing is known, e.g. at thefollowing http-www-linked publication:cs.princeton.edu/˜blei/topicmodeling.html.

It is also known that “Clustering can be used to assist browsing.Browsing tools complement search tools” e.g. as described at thefollowing http-linked publication:pages.cs.wisc.edu/˜pradheep/Clust-LDA.pdf.

Other state of the art related technologies are described inter alia in:

-   1. Papadimitriou, Christos; Raghavan, Prabhakar; Tamaki, Hisao;    Vempala, Santosh (1998). “Latent Semantic Indexing: A probabilistic    analysis” (Postscript). Proceedings of ACM PODS.    http://www.cs.berkeley.edu/˜christos/ir.ps.-   2. Hofmann, Thomas (1999). “Probabilistic Latent Semantic Indexing”    (PDF). Proceedings of the Twenty-Second Annual International SIGIR    Conference on Research and Development in Information Retrieval.    http://www.cs.brown.edu/˜th/papers/Hofmann-SIGIR99.pdf.-   3. Blei, David M.; Ng, Andrew Y.; Jordan, Michael I; Lafferty, John    (January 2003). “Latent Dirichlet allocation”. Journal of Machine    Learning Research 3: 993-1022. doi:10.1162/jmlr 2003.3.4-5.993.    http://jmlr.csail.mit.edu/papers/v3/blei03a.html.-   4. Blei, David M. (April 2012). “Introduction to Probabilistic Topic    Models” (PDF). Comm. ACM 55 (4): 77-84. doi:10.1145/2133806.2133826.    http://www.cs.princeton.edu/˜blei/papers/Blei2011.pdf-   5. Sanjeev Arora; Rong Ge; Ankur Moitra (April 2012). “Learning    Topic Models-Going beyond SVD”. arXiv:1204.1956.-   6. Girolami, Mark; Kaban, A. (2003). “On an Equivalence between PLSI    and LDA”. Proceedings of SIGIR 2003. New York: Association for    Computing Machinery. ISBN 1-58113-646-3.-   7. Griffiths, Thomas L.; Steyvers, Mark (Apr. 6, 2004). “Finding    scientific topics”. Proceedings of the National Academy of Sciences    101 (Suppl. 1): 5228-5235. doi:10.1073/pnas.0307752101. PMC 387300.    PMID 14872004.-   8. Minka, Thomas; Lafferty, John (2002). “Expectation-propagation    for the generative aspect model”. Proceedings of the 18th Conference    on Uncertainty in Artificial Intelligence. San Francisco, Calif.:    Morgan Kaufmann ISBN 1-55860-897-4.-   9. Blei, David M.; Lafferty, John D. (2006). “Correlated topic    models”. Advances in Neural Information Processing Systems 18.-   10. Blei, David M.; Jordan, Michael I.; Griffiths, Thomas L.;    Tenenbaum; Joshua B (2004). “Hierarchical Topic Models and the    Nested Chinese Restaurant Process”. Advances in Neural Information    Processing Systems 16: Proceedings of the 2003 Conference. MIT    Press. ISBN 0-262-20152-6.-   11. Quercia, Daniele; Harry Askham, Jon Crowcroft (2012). “TweetLDA:    Supervised Topic Classification and Link Prediction in Twitter”. ACM    WebSci.-   12. Li, Fei-Fei; Perona, Pietro. “A Bayesian Hierarchical Model for    Learning Natural Scene Categories”. Proceedings of the 2005 IEEE    Computer Society Conference on Computer VISION and Pattern    Recognition (CVPR′05) 2: 524-531.-   13. Wang, Xiaogang; Grimson, Eric (2007). “Spatial Latent Dirichlet    Allocation”. Proceedings of Neural Information Processing Systems    Conference (NIPS).    Topic modeling (Wikipedia): In machine learning and natural language    processing, a topic model is a type of statistical model for    discovering the abstract “topics” that occur in a collection of    documents. An early topic model was described by Papadimitriou,    Raghavan, Tamaki and Vempala in 1998. [1] Another one, called    Probabilistic latent semantic indexing (PLSI), was created by Thomas    Hofmann in 1999.[2] Latent Dirichlet allocation (LDA), perhaps the    most common topic model currently in use, is a generalization of    PLSI developed by David Blei, Andrew Ng, and Michael Jordan in 2002,    allowing documents to have a mixture of topics.[3] Other topic    models are generally extensions on LDA, such as Pachinko allocation,    which improves on LDA by modeling correlations between topics in    addition to the word correlations which constitute topics. Although    topic models were first described and implemented in the context of    natural language processing, they have applications in other fields    such as bioinformatics.    Topics in LDA (Wikipedia): In LDA, each document may be viewed as a    mixture of various topics. This is similar to probabilistic latent    semantic analysis (pLSA), except that in LDA the topic distribution    is assumed to have a Dirichlet prior. In practice, this results in    more reasonable mixtures of topics in a document. It has been noted,    however, that the pLSA model is equivalent to the LDA model under a    uniform Dirichlet prior distribution.[12]

The disclosures of all publications and patent documents mentioned inthe specification, and of the publications and patent documents citedtherein directly or indirectly, are hereby incorporated by reference.Materiality of such publications and patent documents to patentabilityis not conceded.

SUMMARY OF CERTAIN EMBODIMENTS

The following terms may be construed either in accordance with anydefinition thereof appearing in the prior art literature or inaccordance with the specification, or as follows:

Document score: the significance of an individual theme in a particulardocument. For example, if a topic modeling process defines a topic as adistribution over a fixed vocabulary and assumes that each documentincludes various topics each with different proportions determined by aper-document distribution over topics then a document's “score” for aparticular theme may be the document's level of probability given thatdocument's distribution over a universe of topics; this “score” istypically generated in the course of performing conventionaltopic-modeling processes. In other words, each topic x has someprobability θ_yx of being in document y [Blei, D., Ng, A., Jordan, M.(2003) Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993-1022].

Duplicate types:

-   -   Entirely Exact duplicate: Two documents which have the same bits        in the same order.    -   Text exact duplicate: Two documents whose extracted texts are        exact duplicates.    -   Near-duplicate: Two documents are near-duplicate if the        resemblance between them is above a threshold. For example,        conventional w-shingling techniques may be employed.

File: electronic document

Inclusive: an e-mail whose subject and/or body is not contained in anyother e-mail in a given set of emails. This definition implies that ifan email is not an inclusive, its subject and/or body is contained inone of the “inclusive” emails defined for that set of emails. Often, aninclusive culminates and includes an entire email thread.

Overlap: Themes are considered related, or similar, if they overlap.“overlap” may be computed by noting that each document can be assignedto more than one theme/s. Suppose X documents are in theme T_(—)1, and Ydocuments in T_(—)2; The overlap between T_(—)1 and T_(—)2 is (size ofintersection of X and Y)/(size X). The definition of overlap need not besymmetrical; the number of documents that belong to both themes may bedivided by the size of the topic of interest. For example if one theme(say, “parrots”) is included in another theme (say, “birds”) alldocuments in the small theme are typically also included in the secondtheme. Then the overlap from the viewpoint of the small theme is 100%and from the viewpoint of the larger theme is less than 100%.

Pivot: a document which is a representative in some sense, of a set ofnear-duplicates. Any suitable application-specific policy may beemployed to define which document is representative e.g. the g., thedocument in the set which has the highest/lowest/median number of words.

Equivio Zoom's Relevance functionality is a commercially availablesoftware tool that uses “supervised” machine learning, hence there is atypically human expert that trains the system. In the themesfunctionality described herein there is typically no supervision or, insome versions, topics may be “semi-supervised”. Themes functionality asdescribed herein is useful inter alia in training a system by allowing ahuman trainer to search for relevancy by browsing through a largecollection of electronic documents e.g. as described herein. Since inmany cases, finding relevant documents to a specific issue is no simpletask, themes functionality described herein may facilitate the processof finding such documents.

pruning: reducing size of (number and/or size of members in) a data setby removing some members of the set e.g. to achieve a predeterminednumber/total size of members in the set, including prioritizing removalof set members known to be superfluous e.g. duplicates, vis a visremoval of set members not known to be superfluous which is lowerpriority.

Similarity/relatedness:

-   -   similar/related documents/files: Various operational definitions        (metrics) are possible for similar documents e.g.

(a.) documents A, B respectively belonging to theme set A and theme setB where many themes are common to theme sets A and B, or, moregenerally, that the distributions of the two documents over all themesare close; or

(b.) The text of the documents are similar (near-duplicate).

-   -   similar/related themes: various operational definitions        (metrics) are possible for “similarity” of themes, such as but        not limited to themes which have many/few documents in common,        or themes whose names have many/few words in common, or

themes which “overlap” to a considerable degree (over a threshold e.g.).

Theme: A set of documents which relate to a single subject; eachdocument may simultaneously relate to several subjects hence be includedin several themes. For example, some computerized topic modeling methodsyield models in which certain documents have a mixture of topics.

topic: Typically, “topic” as used herein refers to output byconventional topic modeling whereas “theme” typically refers to thatoutput as further processed in accordance with embodiments of thepresent invention.

Unique documents: Documents in a collection that do not have any otherdocuments which are near-duplicates.

Word score: the significance of an individual word to an individualtheme. For example, if a topic modeling process defines a topic as adistribution over a fixed vocabulary and assumes that each documentincludes various topics each with different proportions determined by aper-document distribution over topics, then each word's “score” may beits level of probability given the distribution over the fixedvocabulary defined by the topic; this “score” is typically generated inthe course of performing conventional topic-modeling processes. In otherwords, each word x has some probability β_yx of being in theme y [Blei,D., Ng, A., Jordan, M. (2003) Latent Dirichlet allocation. J. Mach.Learn. Res. 3, 993-1022].

Certain embodiments of the present invention seek to providecomputerized systems and methods for use of themes in e-discovery andother semantic tasks.

Certain embodiments of the present invention seek to provide methods forcomputerized identification of themes in a large data set.

Certain embodiments of the present invention seek to provide methods foruse of multi-topic modeling in e-discovery and other semantic tasks.

Embodiments include:

Embodiment 1

A method for computerized identification of themes in a large data set,the system comprising:

reducing the number of data set members in a large data set, using atleast one computerized data set member pruning technique other thanrandom selection; and

using a computerized theme identification technique for identifying aplurality of themes in the reduced data set.

Embodiment 2

A method according to Embodiment 1 wherein the computerized data setmember pruning technique comprises thinning out at least one documentwhich passes a document similarity criterion relative to at least oneother document not being thinned out, thereby to combat skewing as aresult of over-influence of similar, hence over-represented, documentsupon the theme identification technique.

Embodiment 3

A method according to Embodiment 2 wherein the thinning out at least onedocument which passes a document similarity criterion comprisesreplacing a plurality of emails forming an email thread, with at leastone inclusive email, thereby to thin out emails which are included inthe inclusive email hence are deemed to pass the document similaritycriterion with regard to the inclusive.

Embodiment 4

A method according to Embodiment 2 wherein the thinning out at least onedocument which passes a document similarity criterion comprisesidentifying and discarding near-duplicates thereby to thin out at leastone document which is deemed to pass the document similarity criterionwith regard to a set of near-duplicates of the document, at least one ofwhich is not being thinned out.

Embodiment 5

A method according to Embodiment 1 wherein the computerized themeidentification technique comprises topic modeling.

Embodiment 6

A method according to Embodiment 5 wherein the topic modeling allowsdocuments to have a plurality of topics.

According to Wikipedia, “An early topic model was described byPapadimitriou, Raghavan, Tamaki and Vempala in 1998. [1] Another one,called Probabilistic latent semantic indexing (PLSI), was created byThomas Hofmann in 1999.[2] Latent Dirichlet allocation (LDA), perhapsthe most common topic model currently in use, is a generalization ofPLSI developed by David Blei, Andrew Ng, and Michael I. Jordan in 2002,allowing documents to have a mixture of topics.[3] Other topic modelsare generally extensions on LDA, such as Pachinko allocation, whichimproves on LDA by modeling correlations between topics in addition tothe word correlations which constitute topics.”

Embodiment 7

A browsing system operative in conjunction with a stored representationof a multiplicity of electronic documents and their distribution over aplurality of themes, the system comprising:

theme-to-document flitting apparatus for retrieving and presenting to auser, documents whose document score for at least one user-selectedtheme; is high; and

document-level browsing apparatus for retrieving and presenting to auser, documents whose distributions over the plurality of themes aresimilar to the distribution of a user-selected document over theplurality of themes.

Embodiment 8

A system according to Embodiment 7 and also comprising:

theme-to-word flitting apparatus for retrieving and presenting to auser, words whose word score for at least one user-selected theme; ishigh;

word-level browsing apparatus for retrieving and presenting to a user,words whose distributions over the plurality of themes are similar tothe distribution of a user-selected word over the plurality of themes,

thereby to provide 3-tier browsing apparatus facilitating browsing atword, document and topic levels responsive to user-initiated flittingbetween the levels.

Embodiment 9

A method according to Embodiment 1 and also comprising:

facilitating theme-to-word flitting by retrieving and presenting to auser, words whose word score for at least one user-selected theme; ishigh.

Embodiment 10

A method according to Embodiment 1 and also comprising:

facilitating theme-to-document flitting for retrieving and presenting toa user, documents whose document score for at least one user-selectedtheme is high.

Embodiment 11

A method according to Embodiment 1 and also comprising:

facilitating document-level browsing for retrieving and presenting to auser, documents whose distributions over the plurality of themes aresimilar to the distribution of a user-selected document over theplurality of themes.

Embodiment 12

A method according to Embodiment 1 and also comprising:

facilitating word-level browsing for retrieving and presenting to auser, words whose distributions over the plurality of themes are similarto the distribution of a user-selected word over the plurality ofthemes.

Embodiment 13

A method according to Embodiment 1 wherein the number of data setmembers in the large data set is further reduced subsequent to the usingstep and prior to a manual review process.

Embodiment 14

A method according to Embodiment 1 wherein the reducing is effectedusing:

random selection; and

at least one computerized data set member pruning technique other thanrandom selection.

Embodiment 15

A method according to Embodiment 14 wherein the random selection isperformed after the computerized data set member pruning technique.

Embodiment 16

A method according to Embodiment 14 wherein the random selection isperformed before the computerized data set member pruning technique.

Embodiment 17

A method according to Embodiment 5 wherein the topic modeling whichallows documents to have a plurality of topics comprises one of thefollowing computerized techniques: Latent Dirichlet allocation (LDA),PLSI, and Pachinko allocation.

Embodiment 18

A method according to Embodiment 3 wherein the thinning out at least onedocument which passes a document similarity criterion comprisesreplacing a plurality of emails forming an email thread, with a singleinclusive email.

Embodiment 19

A method according to Embodiment 4 wherein the identifying anddiscarding near-duplicates is effected using Equivio Zoom near-duplicatefunctionality

The present invention also typically includes at least the followingembodiments:

Embodiment a1

An e-discovery method comprising:

Step 1. Input: a set of electronic documentsStep 2: Extract text from the data collection.Step 3: Compute Near-duplicate (ND) on the dataset.Step 4. Compute Email threads (ET) on the dataset.Step 5. Run a topic modeling on a subset of the dataset, including datamanipulation

Embodiment a2

A method according to Embodiment a1 wherein the output of step 3includes all documents having the same DuplicateSubsetID having anidentical text.

Embodiment a3

A method according to Embodiment a wherein the output of step 3 includesall documents x in the set for which there is another document y in theset, such that the similarity between the two is greater than somethreshold.

Embodiment a4

A method according to Embodiment a1 wherein the output of step 3includes at least one pivot document selected by a policy such asmaximum words in the document.

Embodiment a5

A method according to Embodiment a1 wherein the subset includesinclusives of Email threads (ET).

Embodiment a6

A method according to Embodiment a1 wherein the subset includes Pivotsfrom documents and attachments, but not emails.

Embodiment a7

A method according to Embodiment a1 wherein the data manipulationincludes, if the document is an e-mail, removing all e-mail headers inthe document, but keeping the subject line and the body of the e-mail.

Embodiment a8

A method according to Embodiment a7 and also comprising multiplying thesubject line to set some weight to the subject words.

Embodiment a9

A method according to Embodiment a1 wherein the data manipulationincludes Tokenization of the text using separators.

Embodiment a10

A method according to Embodiment a1 wherein the data manipulationincludes ignoring the following features:

Words with length less than (parameter)

Words with length greater than (parameter)

Words that do not start with an alpha character.

(Optionally)—words that contain digits

(Optionally)—words that contain non-AlphaNumeric characters, optionallyexcluding some subset characters such as ‘_’.

Words that are stop words.

Words that appear more than (parameter) times number of words in thedocument.

Words that appear less than (parameter) times number of documents.

Words that appear more than (parameter) times number of documents.

Embodiment a11

A method according to Embodiment a1 wherein the output of step 5includes an assignment of documents to the themes, and an assignment ofwords (features) to themes and each feature x has some probability P_xyof being in theme y and wherein the P matrix is used to construct namesfor at least one theme.

Embodiment a12

A method for computerized Early Case Assessment comprising:

a. Select at random a set of documentsb. Run near-duplicates (ND)c. Run Email threads (ET)d. Select pivot and inclusivee. Run topic modeling;g. Generate theme names; andh. Explore the data by browsing themes.

Embodiment a13

A method according to Embodiment a1 wherein, for Post Case Assessmentrather than using an entire dataset, only the documents that arerelevant to the case are used.

Embodiment a14

A method according to Embodiment a1 and also comprising displaying foreach theme the list of documents that are related to that theme.

Embodiment a15

A method according to Embodiment a14 wherein the user has an option toselect a meta-data and the system will display for each theme thepercentage of that meta-data in that theme.

Also provided, excluding signals, is a computer program comprisingcomputer program code means for performing any of the methods shown anddescribed herein when the program is run on a computer; and a computerprogram product, comprising a typically non-transitory computer-usableor -readable medium e.g. non-transitory computer-usable or -readablestorage medium, typically tangible, having a computer readable programcode embodied therein, the computer readable program code adapted to beexecuted to implement any or all of the methods shown and describedherein. It is appreciated that any or all of the computational stepsshown and described herein may be computer-implemented. The operationsin accordance with the teachings herein may be performed by a computerspecially constructed for the desired purposes or by a general purposecomputer specially configured for the desired purpose by a computerprogram stored in a typically non-transitory computer readable storagemedium. The term “non-transitory” is used herein to exclude transitory,propagating signals or waves, but to otherwise include any volatile ornon-volatile computer memory technology suitable to the application.

Any suitable processor, display and input means may be used to process,display e.g. on a computer screen or other computer output device,store, and accept information such as information used by or generatedby any of the methods and apparatus shown and described herein; theabove processor, display and input means including computer programs, inaccordance with some or all of the embodiments of the present invention.Any or all functionalities of the invention shown and described herein,such as but not limited to steps of flowcharts, may be performed by aconventional personal computer processor, workstation or otherprogrammable device or computer or electronic computing device orprocessor, either general-purpose or specifically constructed, used forprocessing; a computer display screen and/or printer and/or speaker fordisplaying; machine-readable memory such as optical disks, CDROMs, DVDs,BluRays, magnetic-optical discs or other discs; RAMs, ROMs, EPROMs,EEPROMs, magnetic or optical or other cards, for storing, and keyboardor mouse for accepting. The term “process” as used above is intended toinclude any type of computation or manipulation or transformation ofdata represented as physical, e.g. electronic, phenomena which may occuror reside e.g. within registers and/or memories of a computer orprocessor. The term processor includes a single processing unit or aplurality of distributed or remote such units.

The above devices may communicate via any conventional wired or wirelessdigital communication means, e.g. via a wired or cellular telephonenetwork or a computer network such as the Internet.

The apparatus of the present invention may include, according to certainembodiments of the invention, machine readable memory containing orotherwise storing a program of instructions which, when executed by themachine, implements some or all of the apparatus, methods, features andfunctionalities of the invention shown and described herein.Alternatively or in addition, the apparatus of the present invention mayinclude, according to certain embodiments of the invention, a program asabove which may be written in any conventional programming language, andoptionally a machine for executing the program such as but not limitedto a general purpose computer which may optionally be configured oractivated in accordance with the teachings of the present invention. Anyof the teachings incorporated herein may wherever suitable operate onsignals representative of physical objects or substances.

The embodiments referred to above, and other embodiments, are describedin detail in the next section.

Any trademark occurring in the text or drawings is the property of itsowner and occurs herein merely to explain or illustrate one example ofhow an embodiment of the invention may be implemented.

Unless specifically stated otherwise, as apparent from the followingdiscussions, it is appreciated that throughout the specificationdiscussions, utilizing terms such as, “processing”, “computing”,“estimating”, “selecting”, “ranking”, “grading”, “calculating”,“determining”, “generating”, “reassessing”, “classifying”, “generating”,“producing”, “stereo-matching”, “registering”, “detecting”,“associating”, “superimposing”, “obtaining” or the like, refer to theaction and/or processes of a computer or computing system, or processoror similar electronic computing device, that manipulate and/or transformdata represented as physical, such as electronic, quantities within thecomputing system's registers and/or memories, into other data similarlyrepresented as physical quantities within the computing system'smemories, registers or other such information storage, transmission ordisplay devices. The term “computer” should be broadly construed tocover any kind of electronic device with data processing capabilities,including, by way of non-limiting example, personal computers, servers,computing system, communication devices, processors (e.g. digital signalprocessor (DSP), microcontrollers, field programmable gate array (FPGA),application specific integrated circuit (ASIC), etc.) and otherelectronic computing devices.

The present invention may be described, merely for clarity, in terms ofterminology specific to particular programming languages, operatingsystems, browsers, system versions, individual products, and the like.It will be appreciated that this terminology is intended to conveygeneral principles of operation clearly and briefly, by way of example,and is not intended to limit the scope of the invention to anyparticular programming language, operating system, browser, systemversion, or individual product.

Elements separately listed herein need not be distinct components andalternatively may be the same structure.

Any suitable input device, such as but not limited to a sensor, may beused to generate or otherwise provide information received by theapparatus and methods shown and described herein. Any suitable outputdevice or display may be used to display or output information generatedby the apparatus and methods shown and described herein. Any suitableprocessor may be employed to compute or generate information asdescribed herein e.g. by providing one or more modules in the processorto perform functionalities described herein. Any suitable computerizeddata storage e.g. computer memory may be used to store informationreceived by or generated by the systems shown and described herein.Functionalities shown and described herein may be divided between aserver computer and a plurality of client computers. These or any othercomputerized components shown and described herein may communicatebetween themselves via a suitable computer network.

BRIEF DESCRIPTION OF THE DRAWINGS

Certain embodiments of the present invention are illustrated in thefollowing drawings:

FIG. 1 is a simplified flowchart illustration of a method for use ofthemes in e-discovery, according to certain embodiments.

FIG. 2 is a simplified flowchart illustration of a method for early caseassessment, according to certain embodiments.

FIGS. 3 a-3 b, taken together, is a simplified flowchart illustration ofa method for associating topics with documents, according to certainembodiments.

FIG. 4 is a simplified flowchart illustration of a “navigating” orbrowsing method for generating suitable displays to facilitatecomputer-aided theme exploration, suitable e.g. for implementing step100 in FIGS. 3 a-3 b, taken together, according to certain embodiments.

FIG. 5 is a simplified screenshot illustration of an example displayscreen generated by a system constructed and operative in accordancewith certain embodiments. The screen display facilitates theme-levelbrowsing, according to certain embodiments.

FIG. 6 is a simplified screenshot illustration of an example displayscreen generated by a system constructed and operative in accordancewith certain embodiments. As shown, flitting from document-level totheme-level or word-level is facilitated.

FIG. 7 is a simplified screenshot illustration of an example displayscreen generated by a system constructed and operative in accordancewith certain embodiments. As shown, document-level browsing isfacilitated.

FIG. 8 is a simplified flowchart illustration, according to certainembodiments, of a method for utilizing computerized themes functionalityunder these circumstances.

The methods of the flowchart figures each include some or all of theillustrated steps, suitably ordered e.g. as shown.

Computational components described and illustrated herein can beimplemented in various forms, for example, as hardware circuits such asbut not limited to custom VLSI circuits or gate arrays or programmablehardware devices such as but not limited to FPGAs, or as softwareprogram code stored on at least one tangible or intangible computerreadable medium and executable by at least one processor, or anysuitable combination thereof. A specific functional component may beformed by one particular sequence of software code, or by a plurality ofsuch, which collectively act or behave or act as described herein withreference to the functional component in question. For example, thecomponent may be distributed over several code sequences such as but notlimited to objects, procedures, functions, routines and programs and mayoriginate from several computer files which typically operatesynergistically.

Data can be stored on one or more tangible or intangible computerreadable media stored at one or more different locations, differentnetwork nodes or different storage devices at a single node or location.

It is appreciated that any computer data storage technology, includingany type of storage or memory and any type of computer components andrecording media that retain digital data used for computing for aninterval of time, and any type of information retention technology, maybe used to store the various data provided and employed herein. Suitablecomputer data storage or information retention apparatus may includeapparatus which is primary, secondary, tertiary or off-line; which is ofany type or level or amount or category of volatility, differentiation,mutability, accessibility, addressability, capacity, performance andenergy use; and which is based on any suitable technologies such assemiconductor, magnetic, optical, paper and others.

DETAILED DESCRIPTION OF CERTAIN EMBODIMENTS

FIGS. 3 a-3 b, taken together, is a simplified flowchart illustration ofa method for associating topics with documents, according to certainembodiments. The method of FIGS. 3 a-3 b typically include some or allof the following steps, suitably ordered e.g. as shown:

10: Provide a collection of thousands or millions of electronicdocuments (D) e.g. including a mixture of 1 or more of:

non-emails

e-mails with attachments

e-mails without attachments

20 Run Near-duplicate Identifying functionality on the collection,thereby to identify all sets of near-duplicates in the collection

30 Run Email thread Identifying functionality on all emails in thecollection thereby to identify all email threads in the collection

40 perform one, some or all of the following steps to pare down thecollection of documents (D), thereby to yield a pared-down collection(Z):

40 a. Select one (say) document (e.g. pivot document) to represent eachset of near-duplicate set—thereby to yield a set X1 of documents.

40 b. Select (only) inclusive to represent from each email threadthereby to yield a set X2 of inclusive emails

40 c. From the set X2 of inclusive emails select one (say) document fromeach near-duplicate set thereby to yield a set X3 e.g. first select allinclusives then take only one inclusive from each set of “similar”inclusives (e.g. sets defined as near-duplicates by Equivio Zoomnear-duplicate functionality)

50: if number of documents in Z exceeds a threshold, use randomselection to reduce the number of documents in Z to below the threshold

60: select a suitable number, N, of themes to be identified

70: perform topic modeling using documents in Z, thereby to yield Nthemes

80: Apply the topic model generated in step 70, to dataset D, thereby toyield topics wherein documents may belong to more than one topic; usetopics as themes

90: Assign names to the themes. Each word in the set of all words in alldocuments has some probability to be in a theme; this probability maycomprise the “word score”. Typically, the M (predetermined integer e.g.5) top scoring words are selected to represent the theme i.e. toconstitute the theme's name. According to certain embodiments, a namemay comprise one or more of the words most frequently found in thedocuments pertaining to the topic and less frequently or infrequentlyfound in documents not pertaining to the topic.

100: Generate displays (e.g. as per FIG. 4) to facilitate computer-aidedexploration of (browsing between) themes and the documents and/or wordsthey include, where themes are represented in the displays by the themenames selected in step 90.

Step 40A may be performed only on non-emails or may be performed on alldocuments e-mails and non-emails (e.g. e-mails are considereddocuments).

Step 40 b is typically performed on e-mail bodies i.e. without theirattachments.

Step 40 c is typically performed on e-mails without attachments. Afteridentifying inclusives, near duplicate is applied to these andtypically, just one or just a few e-mail/s from each group oftext-similar e-mails is/are selected. For example: if an email threadhas several inclusives, only one of them might be selected.

Typically, random step 50 is performed after near-duplicate andinclusive steps 20, 30 and 40, to enable a user to ascertain thatfurther random pruning is necessary since it is possible that steps 20,30, 40 reduce the size of the data set sufficiently without requiringany random pruning. However, alternatively or in addition, randompruning may occur before steps 20, 30, and 40.

Typically, random step 50 is performed only when it is desired to reduceprocessing time whereas for a small set of documents, e.g. less than 400thousand documents, step 50 may be omitted. Optionally, the systemcomputes cost (monetary or in terms of time) of topic modeling both withrandom selection and without. The system may for example compute thetime or cost to compute a topic model on a random sample which is, say,50%/10%/1% the size of the original data set.

FIG. 4 is a “Navigating” or browsing method for generating suitabledisplays to facilitate computer-aided theme exploration, suitable e.g.for implementing step 100 in FIGS. 3 a-3 b, taken together.Alternatively, the method of FIG. 4 may be employed to facilitatecomputer-aided exploration of any set of themes, which need not havebeen generated using any or all of steps 10-90 in FIGS. 3 a-3 b, takentogether. The method of FIG. 4 may include some or all of the followingsteps, suitably ordered e.g. as shown:

410: Receive e.g. from user, a theme attribute by which to sort themes,e.g.

-   -   Number of documents in theme    -   Document Score-related attribute e.g. theme's average or median        or mode document score    -   How many times has theme been accessed in the past, using stored        history of user/group of users    -   Theme name (can be sorted in alphabetical order)    -   % (richness) or absolute number of documents belonging to theme        which match a predicate (e.g. are relevant to a predicate, e.g.        using Equivio relevance software tool). A predicate is a logical        combination of conditions that the documents must satisfy.        Examples of conditions: specific document-types, specific        languages, above/below a relevance score generated e.g. by        Equivio Zoom's relevance functionality A predicate may be        user-selected e.g. via a suitable GUI.

420: Sort themes by a default or user-selected (in step 410) themeattribute and display themes in order determined by sort process ORdisplay only themes which match a criterion (example criteria: more than85% of documents in theme are relevant to user-selected predicate, themename includes “Kennedy”, theme includes more than 1000 documents).

430: display theme attribute, in association with displayed theme e.g.how many documents belonging to theme match a predicate (e.g. arerelevant to a predicate, e.g. using Equivio relevance software tool).

440: responsive to a user's selection of (e.g. clicking on a displayed)theme, identify themes which are similar to the user-selected theme byidentifying themes which have many (number>threshold) documents incommon with the user selected theme).

450: responsive to a user's selection of (e.g. clicking on a displayed)theme,

sort the documents in the theme by a default or user-selected documentattribute. Document attribute may include metadata (Custodian, date) ortheme related data (e.g. relevance of document to selected predicate,e.g. using Equivio relevance software tool) and display documents inthemes in order determined by sort process.

460: responsive to a user's selection of (e.g. clicking on a displayed)document,

Select and display files whose distributions, e.g. rank distributions,over topics are similar to the selected document's distribution e.g.rank distribution over topics. For example, take the vector of scores ofthe selected document over all themes e.g., for 5 themes, (0.4, 0.01,0.7, 0, 0); then display all documents whose distance from the above isless than a constant. Any suitable distance metric or function may beemployed such as but not limited to Euclidean distance, L-infinitydistance (max entry distance), L−1 distance, and Manhattan distance.

470: responsive to a user's selection of a document attribute (e.g.metadata (Custodian, date)), compute distribution and display (e.g. ashistogram): number (or %) of documents under (say) custodian C or date Dbelonging to each theme.

The Themes functionality herein is particularly useful for identifyingrelevant documents in a large collection of electronic documents whichis sparse in that only a small number of documents are relevant to aparticular issue. This is especially the case if it is not possible toidentify keywords which can be used to tag relevant documents on thebasis of a simple keyword search.

Computerized systems for identifying relevant documents in a largecollection of electronic documents exist, such as Equivio Zoom'sRelevance functionality.

However, for a sparse document set, it is sometimes necessary to seedthe initial training with pre-identified relevant documents, rather thanrandomly selecting a training set which might include a tiny or zeroamount of relevant documents. For example, the current Equivio Zoom userguide describes (in section 6.3, from page 58 onward) a process ofAdding Seed Files to an Issue.

FIG. 5 is a simplified screenshot illustration of an example displayscreen generated by a system constructed and operative in accordancewith certain embodiments. As shown, each theme is presented togetherwith a bar (right side of screen) indicating relevance to auser-selected issue. As shown, order of presentation of the screens isin accordance with length of the bar. The screen display facilitatestheme-level browsing, according to certain embodiments. For example, thebars may indicate the number of files per theme, that are Relevant (e.g.as determined manually, or automatically e.g. by Equivio Zoom'sRelevance functionality) to a user-selected issue which may if desiredbe shown on, and selected via, a suitable GUI (not shown). The screendisplay facilitates theme-level browsing, according to certainembodiments.

FIG. 6 is a simplified screenshot illustration of an example displayscreen generated by a system constructed and operative in accordancewith certain embodiments. As shown, documents are presented along with(on the left) words in the documents and their word scores, relative toan individual theme, as well as themes related to the individual theme.The words and themes may be presented in descending order of their wordscores and relatedness to the individual theme, respectively. If arelated theme or word is selected (e.g. clicked upon), a different“semantic view” is generated; for example, of all documents in theselected related theme, using a screen format which may be similar tothat of FIG. 6. As shown, flitting from document-level to theme-level orword-level is facilitated.

FIG. 7 is a simplified screenshot illustration of an example displayscreen generated by a system constructed and operative in accordancewith certain embodiments. As shown, document-level browsing isfacilitated.

FIG. 8 is a simplified flowchart illustration, according to certainembodiments, of a method for utilizing computerized Themes functionalityunder these circumstances. The method of FIG. 8 typically includes someor all of the following steps, suitably ordered e.g. as shown:

1010. use computerized Themes functionality to identify an initial“seed” set of (say 5-30) relevant documents in a large sparse collectionof electronic documents.

1020. generate a training set of documents including the initial “seed”set of relevant documents and at least an equal number of documentsrandomly selected from the large sparse collection of electronicdocuments.

1030. operate computerized relevant document identification system, e.g.Equivio Zoom's Relevance functionality on the training set, thereby tosuccessfully identify the rare relevant documents in the large sparsecollection.

It is appreciated that step 1010 may be performed in any suitablemanner. For example, if at least one relevant document is known, step1010 may comprise:

a. running the “themes” functionality to obtain a “thematicdistribution” for the relevant document e.g. an indication of thesignificance of each of the various themes to the document. Some topicmodeling software provides “document scores” for each document relativeto each theme, indicating significance of each of the various themes toeach document. Alternatively, if the top key words on the key word listof a theme occur relatively frequently in some documents and relativelyinfrequently in others, the theme can be regarded as highly significantto the former documents and less significant to the latter documents.

b. selecting documents within the large collection of electronicdocuments whose “thematic distribution” is similar, using a suitablemetric, to the “thematic distribution” of the document known to berelevant. A suitable metric for similarity between Document 1's“thematic distribution” and Document 2's “thematic distribution” may forexample be a Euclidean distance (sum of squares-based e.g.) between thedocument scores of Document 1, summed over all themes, and the documentscores of Document 2, summed over all themes. Other distance metrics mayalso be employed e.g. L-infinity distance (max entry distance), L−1distance, and Manhattan distance.

It is appreciated that computerized processing tends to generateclusters (and topics) that are artifactual. For example—presence of theword “weekend” might trigger definition of a cluster of documents which,upon inspection, would be found to include a mass of emails about anunrelated variety of subjects united only by the fact that the emailswere written on a Friday hence include an exhortation to “have a niceweekend”. In multi-topic processing (e.g. topic modeling in which onedocument can be assigned to several topics), this is of less relevance:of the many topics found, some are safely ignored as artifactual and thesystem as a whole remains workable. In clustering (in which eachdocument can belong to only one topic) however, important documents canbe assigned to an artifactual cluster and thereby effectively disappearsince disregarding the artifactual cluster tends to lead to disregardingdocuments assigned thereto.

It is appreciated that the systems and methods shown and describedherein enable a S-tier browsing system to be generated, in which a usercan browse at the word, document/file or topic level, and can move fromone level to another. For example, a user may look at a presentation oftopics, arranged say by relevance to an issue, and the system maypresent to her or him, words or documents whose score for the theme/sthe user has selected, are high. The system may for example compute wordscores or document scores for all words or documents, sort the words ordocuments, and present to the user only those whose word or documentscores is high. The user may then select one of those words ordocuments, thereby browsing to a different level. When s/he does select,say, a document scoring high for the topic s/he previously was viewing,the system then shows the document, and also identifies and displaysindications of themes to which the document is strongly related, andwords whose document scores are high for the themes to which thedocument is strongly related. The system may do this by computing thedegree of relatedness of the document to all themes (each of thethemes), sorting the themes on this basis, and presenting to the useronly those themes for which the document's degree of relatedness ishigh. Again the user can change level, from the document level up to thetopic level or down to the word level, or the user may continue tobrowse at the document level, e.g. to documents whose distribution overthe identified themes is similar (using a suitable distance metric) tothe distribution over the identified themes of the document of previousinterest. To support this, the system may compute all documents'distributions over all themes identified, and may also compute thedistances between these distributions, either in advance for alldocument pairs, or in real time for a user-designated document. Thesystem may then present, responsive to a user request for documentssimilar to document D, the top few documents from a list of documentssorted in accordance with the documents' respective distances fromDocument D. Alternatively, a user may perform “word-level” browsing bymoving from one word to another word which has a similar distributionover N identified topics. To support this, the system may compute allwords' distributions over all themes identified, and may also computethe distances between these distributions, either in advance for allword pairs, or in real time for a user-designated word. The system maythen present, responsive to a user request for words similar to anindividual word W of interest, the top few words from a list of wordssorted in accordance with the words' respective distances from Word W.

Another embodiment of the invention, e.g. as described above withreference to FIG. 4, is a browsing system operative in conjunction witha stored representation of a multiplicity of electronic documents andtheir distribution over a plurality of themes, the system comprisingsome or all of the following:

theme-to-word flitting apparatus for retrieving and presenting to auser, words whose word score for at least one user-selected theme; ishigh;

theme-to-document flitting apparatus for retrieving and presenting to auser, documents whose document score for at least one user-selectedtheme; is high;

document-level browsing apparatus for retrieving and presenting to auser, documents whose distributions over the plurality of themes aresimilar to the distribution of a user-selected document over theplurality of themes

word-level browsing apparatus for retrieving and presenting to a user,words whose distributions over the plurality of themes are similar tothe distribution of a user-selected word over the plurality of themes,

thereby to provide 2- or 3-tier browsing apparatus facilitating browsingat word, document and topic levels responsive to user-initiated flittingbetween the levels.

It is appreciated that any suitable parameters and work-processes may beemployed. For example, a set of electronic documents comprisingthousands, tens or hundreds of thousands, or millions of electronicdocuments may be processed as described herein.

Typically, the number of themes to identify is selected by a user andany suitable number of themes may be requested by the user such as 10,20, 50, 100, 200 or 500 themes. For example, the number of themesselected may be, perhaps, 200 themes for a collection of a few hundredthousand electronic documents, and proportionally more or less themes ifthe number of documents in the collection is proportionally larger orsmaller.

Any suitable “view” of themes may be provided, such as themes sorted bynumber of files or meta-data attributes of the files, themes sorted byvarious attributes of the words in the theme name, themes sorted byrelevance to an issue and so forth.

A particular advantage of certain embodiments is that documents whichare known to be mutually similar or near duplicates are “thinned” sothat they do not over-influence or skew the topic modeling process.

It is appreciated that thinning need not result in retaining only asingle pivot or only a single inclusive email, instead one may, ifappropriate, reduce the influence of repeated or highly relatedmaterials without eliminating the repetition entirely.

Regarding topic-modeling steps herein e.g. step v of FIG. 2, step 3 ofFIG. 2, step 70 of FIG. 3:

A topic model is a computational functionality analyzing a set ofdocuments and yielding “topics” that occur in the set of documentstypically including (a) what the topics are and (b) what each document'sbalance of topics is. According to Wikipedia, “Intuitively, given that adocument is about a particular topic, one would expect particular wordsto appear in the document more or less frequently: “dog” and “bone” willappear more often in documents about dogs, “cat” and “meow” will appearin documents about cats, and “the” and “is” will appear equally in both.A document typically concerns multiple topics in different proportions”.Topic models may analyze large volumes of unlabeled text and each“topic” may consist of a cluster of words that occur togetherfrequently.

Another definition, from the following http location:

faculty.washington.edu/jwilker/559/SteyversGriffiths.pdf, is that topicmodeling functionality proceeds from an assumption “that documents aremixtures of topics, where a topic is a probability distribution overwords. A topic model is a generative model for documents: it specifies asimple probabilistic procedure by which documents can be generated. Tomake a new document, one chooses a distribution over topics. Then, foreach word in that document, one chooses a topic at random according tothis distribution, and draws a word from that topic. Standardstatistical techniques can be used to invert this process, inferring theset of topics that were responsible for generating a collection ofdocuments.”

Topic modeling as used herein includes any or all of the above, as wellas any computerized functionality which inputs text/s and uses aprocessor to generate and output a list of semantic topics which thetext/s are assumed to pertain to, wherein each “topic” comprises a listof keywords assumed to represent a semantic concept.

Topic modeling includes but is not limited to any and all of: the Topicmodeling functionality described by Papadimitriou, Raghavan, Tamaki andVempala in 1998; Probabilistic latent semantic indexing (PLSI), createdby Thomas Hofmann in 1999; Latent Dirichlet allocation (LDA), developedby David Blei, Andrew Ng, and Michael I. Jordan in 2002 and allowingdocuments to have a mixture of topics; extensions on LDA, such as butnot limited to Pachinko allocation;

Griffiths & Steyvers Topic modeling e.g. as published in 2002, 2003,2004; Hofmann Topic modeling e.g. as published in 1999, 2001; topicmodeling using the synchronic approach; topic modeling using thediachronic approach; Topic modeling functionality which attempts to fitappropriate model parameters to the data corpus using heuristic/s formaximum likelihood fit, topic modeling functionality with provableguarantees; topic modeling functionality which uses singular valuedecomposition (SVD), topic modeling functionality which uses the methodof moments, topic modeling functionality which uses an algorithm basedupon non-negative matrix factorization (NMF); and topic modelingfunctionality which allows correlations among topics. Topic modelingimplementations may for example employ Mallet (software project),Stanford Topic Modeling Toolkit, or GenSim—Topic Modeling for Humans.

Earlier presented embodiments are now described for use eitherindependently or in suitable combination with the embodiments describedabove:

When enhancing expert-based computerized analysis of a set of digitaldocuments, a system for computerized derivation of leads from a hugebody of data may be provided, the system comprising:

an electronic repository including a multiplicity of accesses to arespective multiplicity of electronic documents and metadata includingmetadata parameters having metadata values characterizing each of themultiplicity of electronic documents;

a relevance rater using a processor to run a first computer algorithm onthe multiplicity of electronic documents which yields a relevance scorewhich rates relevance of each of the multiplicity of electronicdocuments to an issue; and

a metadata-based relevant-irrelevant document discriminator using aprocessor to rapidly run a second computer algorithm on at least some ofthe metadata which yields leads, each lead comprising at least onemetadata value for at least one metadata parameter, which valuecorrelates with relevance of the electronic documents to the issue.

The application is operative to find outliers of a given metadata andrelevancy score (i.e. relevant, not relevant). When theme-exploring isused, the system can identify themes with high relevancy score based onthe given application. The above system, without theme-exploring, maycompute the outlier for a given metadata, and each document appears onein each metadata. In the theme-exploring settings for a given set ofthemes the same document might fall into several of the metadata.

Method for Use of Themes in e-Discovery (FIG. 1):

step i. Input: a set of electronic documents. The documents could be in:Text format, Native files (PDF, Word, PPT, etc.), ZIP files, PST, Lotusnotes, MSG, etc.Step ii Extract text from the data collection. Text extraction can bedone by third party software such as: Oracle inside out, iSys, DTSearch,iFilter, etc.Step iii: Compute Near-duplicate (ND) on the dataset.The following teachings may be used: U.S. Pat. No. 8,015,124, entitled“A Method for Determining Near Duplicate Data Objects”; and/or WO2007/086059, entitled “Determining Near Duplicate “Noisy” Data Objects”,and/or suitable functionalities in commercially available e-discoverysystems such as those of Equivio.

For each document compute the following:

Step iiia: DuplicateSubsetID: all documents having the sameDuplicateSubsetID having an identical text.Step iiib: EquiSetID: all documents having the same EquiSetID aresimilar (for each document x in the set there is another document y inthe set, such that the similarity between the two is greater than somethreshold).Step iiic: Pivot: 1 if the document is a representative of the set (and0 otherwise). Typically, for each EquiSet only one document is selectedas Pivot. The pivot document can be selected by a policy for example(maximum words number of words, minimum number of words, median numberof words, minimum docid, etc.) When using theme networking (TN) it isrecommended to use maximum words in documents as pivot policy as it isdesirable for largest documents to be in the model.Step iv. Compute Email threads (ET) on the dataset. The followingteachings may be used: WO 2009/004324, entitled “A Method for OrganizingLarge Numbers of Documents” and/or suitable functionalities incommercially available e-discovery systems such as those of Equivio.The output of this phase is a collection of trees, and all leafs of thetrees are marked as inclusive. Note, that family information is accepted(to group e-mails with their attachments).Step v. Run a topic modeling algorithm (such as LDA) on a subset of thedataset, including feature extraction. Resulting topics are defined asthemes. The subset includes the following documents:

-   -   Inclusive from Email threads (ET)    -   Pivots from all documents that are not e-mails. i.e. pivots from        documents and attachments.

The data collection include less files (usually the size is 50% of thetotal size); and the data do not include similar documents, therefore ifa document appears many times in the original data collection it willhave the same weight as if it appears once.

In building the model documents were used with more than 25 (parameter)words and less than 20,000 words. The idea behind this limitation was toimprove performance, and not be influenced by high words frequency whenthe document has few features.

If the dataset is extremely large, at most 100,000 (parameter) documentsmay be selected at random to build the model, and after building themodel, it may be applied on all other documents.

The first step in the topic modeling algorithm is to extract featuresfrom each document.

A method suitable for the Feature extraction of step v may includeobtaining features as follows:

A topic modeling algorithm uses features to create the model for thetopic-modeling step v above. The features are words; to generate a listof words from each word one may do the following:

If the document is an e-mail, remove all e-mail headers in the document,but keep the subject line and the body. One may multiply the subjectline to set some weight to the subject words. Tokenize the text usingseparators such as, spaces, semicolon, colon, tabs, new line etc. Ignorethe following features:Words with length less than 3 (parameter)Words with length greater than 20 (parameter)Words that do not start with an alpha character.Words that are stop words.Words that appear more than 0.2 times number of words in the document.(parameter)Words that appear in less than 0.01 times number of documents.(Parameter)Words that appear in more than 0.2 times number of documents.(Parameter)Stemming, part-of-speech—as features.Step viii. Theme names. The output of step v includes an assignment ofdocuments to the themes, and an assignment of words (features) tothemes. Each feature x has some probability P_xy of being in theme y.Using the P matrix, construct names to the themes.

In e-discovery one may use the following scenarios: Early CaseAssessment, Post Case Assessment and provision of helpful UserInterfaces.

Early Case Assessment (FIG. 2, Including Some or all of the FollowingSteps a-h):a. Select at random 100000 documentsb. Run near-duplicates (ND)c. Run Email threads (ET)d. Select pivot and inclusivee. Run topic modeling using the above feature selection. The input ofthe topic modeling is a set of documents. The first phase of the topicmodeling is to construct a set of features for each document. Thefeature getting method described above may be used to construct the setof features.f. Run the model on all other documents (optional).g. Generate theme names e.g. using step viii above.h. Explore the data by browsing themes; one may open a list of documentsbelonging to a certain theme, from the document one may see all themesconnected to that document, and go to other themes.The list of documents might be filtered by a condition set by the user.For example filter all documents by dates, relevancy, file size, etc.The above procedure assists users in early case assessment when the datais known and one would like to know what is in the data, and assess thecontents of the data collection.In early case assessment one may randomly sample the dataset to getresults faster.

Post Case Assessment This process uses some or all of steps I-v above,but in this setting an entire dataset is not used, but rather, only thedocuments that are relevant to the case. If near-duplicates (ND) andEmail threads (ET) have already run, there is no need to re-run them.

1^(st) pass review is a quick review of the documents that can behandled manually or by an automatic predictive coding software; the userwishes to review the results and get an idea on the themes of thedocuments that passed that review. This phase is essential because thenumber of such documents might be extremely large. Also, there are casesin which, in some sub-issues, there are only a few documents.

The above building block can generate a procedure for such cases. Here,g only documents that passed the 1^(St) review phase are taken, andthemes are calculated for them.

User Interface using the output of steps I-v and displaying resultsthereof. Upon running the topic modeling each resulting topic is definedas a theme, and for each theme the list of documents is displayed thatare related to that theme. The user has an option to select a meta-data(for example is the document relevant to an issue, custodian,date-range, file type, etc.) and the system will display for each themethe percentage of meta-data in that theme. Such presentation wouldassist the user while evaluating the theme.

An LDA model might have themes that can be classified as CAT_related andDOG_related. A theme has probabilities of generating various words, suchas milk, meow, and kitten, which can be classified and interpreted bythe viewer as “CAT_related”. The word cat itself will have highprobability given this theme. The DOG_related theme likewise hasprobabilities of generating each word: puppy, bark, and bone might havehigh probability. Words without special relevance, such as the (seefunction word), will have roughly even probability between classes (orcan be placed into a separate category). A theme is not stronglydefined, neither semantically nor epistemologically. It is identified onthe basis of supervised labeling and (manual) pruning on the basis oftheir likelihood of co-occurrence. A lexical word may occur in severalthemes with a different probability, however, with a different typicalset of neighboring words in each theme.

Each document is assumed to be characterized by a particular set ofthemes. This is akin to the standard bag of words model assumption, andmakes the individual words exchangeable.

Processing a large data set requires time and space, in the context ofthe current invention N documents are selected to create the model, andthen the model is applied on the remaining documents.

When selecting the documents to build the model, a few options may bepossible:

-   -   O1. Take all documents.    -   O2. Take one documents for each set of exact duplicate documents    -   O3. Take one documents from each EquiSet (e.g. as per U.S. Pat.        No. 8,015,124, entitled “A Method for Determining Near Duplicate        Data Objects”; and/or WO 2007/086059, entitled “Determining Near        Duplicate “Noisy” Data Objects”).

-   O4. Take the inclusive from the data collection. Another option is    to randomly sample X documents from the collection, as described    above.    Steps 02, 03, 04 aim to create themes that are known to the user,    and also not to weight documents that already appear in a known set.    The input for the algorithm is a text documents that can be parsed    to a bag-of-words. When processing an e-mail, one may notice that    the e-mail contains a header (From, to, CC, Subject); and a body.    The body of an e-mail can be a formed by a series of e-mails.

For example: From: A To: B Subject: CCCCC Body1 Body1    From: B    To:A    Subject: CCCCC    Body2 Body2While processing e-mails for topic modeling one can consider removingall e-mail headers within the body, and by setting a weight to thesubject by using a multiple subject line. In the above example theprocessed text would be:

CCCCC CCCCC CCCCC Body1 Body1 Body2 Body2Step viii (Theme names) is now described in detail:Let P(w_i,t_j) the probability that the feature w_i belongs to themet_j. In known implementations the theme name is a list of words with thehighest probability. The solution is good when the dataset is sparse,i.e. the vocabulary of the themes is different from each other. Ine-discovery the issues are highly connected and therefore, there arecases when the “top” words appeared in two or more themes. In settingsof the problem “stable marriage” was used as in an algorithm, to pairwords to themes. The algorithm may include:

Order the theme by some criteria (Size, Quality, #of relevant documents,etc.); i.e. theme_3 is better than theme_4. (1) Create an empty set S(2) Sort themes by some criteria (3) For j=0 ; j < maximum words intheme name; j++ (4) For I = 0 ; I < #number of themes; i++) do (5) Fortheme_i, assign the word with the highest score that is not in S, andadd that word to S

After X words are assigned for each theme, the number of words can bereduced by, for example, taking only those words in each theme that arebigger than the maximum word rank in that theme, divided by someconstant.

Typically, electronic documents do not bear, or do not need to bear, anypre-annotation or labeling or meta-data, or if they do, such is notemployed by the topic modeling which instead is derived by analyzing theactual texts.

A particular advantage of certain embodiments of the invention is thatcollections of electronic documents are hereby analyzed semantically bya processor on a scale that would be impossible manually. Output oftopic modeling may include the n most frequent words from the m mostfrequent topics found in an individual document.

It is appreciated that when presenting documents, it need not be thecase that all documents whose document score for at least oneuser-selected theme; is high in a defined sense e.g. over a certainthreshold are displayed. Similarly, it need not be the case that alldocuments whose distributions over the plurality of themes are similarin a defined sense to the distribution of a user-selected document overthe plurality of themes, are described. Instead, only a subset of thedocuments may be displayed, e.g. only such documents as answer at leastone individual criterion. So, for example, a search engine could be usedon the data collection, and then results of the search query might bepresented using the embodiments shown and described herein.Alternatively or in addition, a predicate may be used as a criterione.g. presenting only documents in English or only documents relevant toa given issue.

The methods shown and described herein are particularly useful inprocessing or analyzing or sorting or searching bodies of knowledgeincluding hundreds, thousands, tens of thousands, or hundreds ofthousands of electronic documents or other computerized informationrepositories, some or many of which are themselves at least tens orhundreds or even thousands of pages long. This is because practicallyspeaking, such large bodies of knowledge can only be processed,analyzed, sorted, or searched using computerized technology.

It is appreciated that terminology such as “mandatory”, “required”,“need” and “must” refer to implementation choices made within thecontext of a particular implementation or application describedherewithin for clarity and are not intended to be limiting since in analternative implantation, the same elements might be defined as notmandatory and not required or might even be eliminated altogether.

It is appreciated that software components of the present inventionincluding programs and data may, if desired, be implemented in ROM (readonly memory) form including CD-ROMs, EPROMs and EEPROMs, or may bestored in any other suitable typically non-transitory computer-readablemedium such as but not limited to disks of various kinds, cards ofvarious kinds and RAMs. Components described herein as software may,alternatively, be implemented wholly or partly in hardware and/orfirmware, if desired, using conventional techniques, and vice-versa.Each module or component may be centralized in a single location ordistributed over several locations.

Included in the scope of the present invention, inter alia, areelectromagnetic signals carrying computer-readable instructions forperforming any or all of the steps or operations of any of the methodsshown and described herein, in any suitable order including simultaneousperformance of suitable groups of steps as appropriate; machine-readableinstructions for performing any or all of the steps of any of themethods shown and described herein, in any suitable order; programstorage devices readable by machine, tangibly embodying a program ofinstructions executable by the machine to perform any or all of thesteps of any of the methods shown and described herein, in any suitableorder; a computer program product comprising a computer useable mediumhaving computer readable program code, such as executable code, havingembodied therein, and/or including computer readable program code forperforming, any or all of the steps of any of the methods shown anddescribed herein, in any suitable order; any technical effects broughtabout by any or all of the steps of any of the methods shown anddescribed herein, when performed in any suitable order; any suitableapparatus or device or combination of such, programmed to perform, aloneor in combination, any or all of the steps of any of the methods shownand described herein, in any suitable order; electronic devices eachincluding a processor and a cooperating input device and/or outputdevice and operative to perform in software any steps shown anddescribed herein; information storage devices or physical records, suchas disks or hard drives, causing a computer or other device to beconfigured so as to carry out any or all of the steps of any of themethods shown and described herein, in any suitable order; a programpre-stored e.g. in memory or on an information network such as theInternet, before or after being downloaded, which embodies any or all ofthe steps of any of the methods shown and described herein, in anysuitable order, and the method of uploading or downloading such, and asystem including server/s and/or client/s for using such; a processorconfigured to perform any combination of the described steps or toexecute any combination of the described modules; and hardware whichperforms any or all of the steps of any of the methods shown anddescribed herein, in any suitable order, either alone or in conjunctionwith software. Any computer-readable or machine-readable media describedherein is intended to include non-transitory computer- ormachine-readable media.

Any computations or other forms of analysis described herein may beperformed by a suitable computerized method. Any step described hereinmay be computer-implemented. The invention shown and described hereinmay include (a) using a computerized method to identify a solution toany of the problems or for any of the objectives described herein, thesolution optionally includes at least one of a decision, an action, aproduct, a service or any other information described herein thatimpacts, in a positive manner, a problem or objectives described herein;and (b) outputting the solution.

The system may, if desired, be implemented as a web-based systememploying software, computers, routers and telecommunication equipmentas appropriate.

Any suitable deployment may be employed to provide functionalities e.g.software functionalities shown and described herein. For example, aserver may store certain applications, for download to clients, whichare executed at the client side, the server side serving only as astorehouse. Some or all functionalities e.g. software functionalitiesshown and described herein may be deployed in a cloud environment.Clients e.g. mobile communication devices such as smartphones may beoperatively associated with, but external to the cloud.

The scope of the present invention is not limited to structures andfunctions specifically described herein and is also intended to includedevices which have the capacity to yield a structure, or perform afunction, described herein, such that even though users of the devicemay not use the capacity, they are, if they so desire, able to modifythe device to obtain the structure or function.

Features of the present invention which are described in the context ofseparate embodiments may also be provided in combination in a singleembodiment.

For example, a system embodiment is intended to include a correspondingprocess embodiment. Also, each system embodiment is intended to includea server-centered “view” or client centered “view”, or “view” from anyother node of the system, of the entire functionality of the system,computer-readable medium, apparatus, including only thosefunctionalities performed at that server or client or node.

Conversely, features of the invention, including method steps, which aredescribed for brevity in the context of a single embodiment or in acertain order may be provided separately or in any suitablesubcombination or in a different order. “e.g.” is used herein in thesense of a specific example which is not intended to be limiting.Devices, apparatus or systems shown coupled in any of the drawings mayin fact be integrated into a single platform in certain embodiments ormay be coupled via any appropriate wired or wireless coupling such asbut not limited to optical fiber, Ethernet, Wireless LAN, HomePNA, powerline communication, cell phone, PDA, Blackberry GPRS, Satelliteincluding GPS, or other mobile delivery. It is appreciated that in thedescription and drawings shown and described herein, functionalitiesdescribed or illustrated as systems and sub-units thereof can also beprovided as methods and steps therewithin, and functionalities describedor illustrated as methods and steps therewithin can also be provided assystems and sub-units thereof. The scale used to illustrate variouselements in the drawings is merely exemplary and/or appropriate forclarity of presentation and is not intended to be limiting.

1. A method for computerized identification of themes in a large data set, the system comprising: reducing the number of data set members in a large data set, using at least one computerized data set member pruning technique other than random selection; and using a computerized theme identification technique for identifying a plurality of themes in the reduced data set.
 2. A method according to claim 1 wherein said computerized data set member pruning technique comprises thinning out at least one document which passes a document similarity criterion relative to at least one other document not being thinned out, thereby to combat skewing as a result of over-influence of similar, hence over-represented, documents upon said theme identification technique.
 3. A method according to claim 2 wherein said thinning out at least one document which passes a document similarity criterion comprises replacing a plurality of emails forming an email thread, with at least one inclusive email, thereby to thin out emails which are included in said inclusive email hence are deemed to pass the document similarity criterion with regard to said inclusive.
 4. A method according to claim 2 wherein said thinning out at least one document which passes a document similarity criterion comprises identifying and discarding near-duplicates thereby to thin out at least one document which is deemed to pass the document similarity criterion with regard to a set of near-duplicates of said document, at least one of which is not being thinned out.
 5. A method according to claim 1 wherein said computerized theme identification technique comprises topic modeling.
 6. A method according to claim 5 wherein said topic modeling allows documents to have a plurality of topics.
 7. A browsing system operative in conjunction with a stored representation of a multiplicity of electronic documents and their distribution over a plurality of themes, the system comprising: theme-to-document flitting apparatus for retrieving and presenting to a user, documents whose document score for at least one user-selected theme; is high; and document-level browsing apparatus for retrieving and presenting to a user, documents whose distributions over the plurality of themes are similar to the distribution of a user-selected document over the plurality of themes.
 8. A system according to claim 7 and also comprising: theme-to-word flitting apparatus for retrieving and presenting to a user, words whose word score for at least one user-selected theme; is high; word-level browsing apparatus for retrieving and presenting to a user, words whose distributions over the plurality of themes are similar to the distribution of a user-selected word over the plurality of themes, thereby to provide 3-tier browsing apparatus facilitating browsing at word, document and topic levels responsive to user-initiated flitting between the levels.
 9. A method according to claim 1 and also comprising: facilitating theme-to-word flitting by retrieving and presenting to a user, words whose word score for at least one user-selected theme; is high.
 10. A method according to claim 1 and also comprising: facilitating theme-to-document flitting for retrieving and presenting to a user, documents whose document score for at least one user-selected theme is high.
 11. A method according to claim 1 and also comprising: facilitating document-level browsing for retrieving and presenting to a user, documents whose distributions over the plurality of themes are similar to the distribution of a user-selected document over the plurality of themes.
 12. A method according to claim 1 and also comprising: facilitating word-level browsing for retrieving and presenting to a user, words whose distributions over the plurality of themes are similar to the distribution of a user-selected word over the plurality of themes.
 13. A method according to claim 1 wherein the number of data set members in the large data set is further reduced subsequent to said using step and prior to a manual review process.
 14. A method according to claim 1 wherein said reducing is effected using: random selection; and at least one computerized data set member pruning technique other than random selection.
 15. A method according to claim 14 wherein said random selection is performed after said computerized data set member pruning technique.
 16. A method according to claim 14 wherein said random selection is performed before said computerized data set member pruning technique.
 17. A method according to claim 5 wherein said topic modeling which allows documents to have a plurality of topics comprises one of the following computerized techniques: Latent Dirichlet allocation (LDA), PLSI, and Pachinko allocation.
 18. A method according to claim 3 wherein said thinning out at least one document which passes a document similarity criterion comprises replacing a plurality of emails forming an email thread, with a single inclusive email.
 19. A method according to claim 4 wherein said identifying and discarding near-duplicates is effected using Equivio Zoom near-duplicate functionality.
 20. A method according to claim 9 and wherein said facilitating comprises retrieving and presenting to a user, only those words whose word score for at least one user-selected theme; is high and which answer to at least one additional criterion. 